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Abstract 

This paper presents results from the first at- 
tempt to apply Transformation-Based Learn- 
ing to a discourse-level Natural Language 
Processing task. To address two limita- 
tions of the standard algorithm, we developed 
a Monte Carlo version of Transformation- 
Based Learning to make the method 
tractable for a wider range of problems 
without degradation in accuracy, and we 
devised a committee method for assigning 
confidence measures to tags produced by 
Transformation-Based Learning. The pa- 
per describes these advances, presents ex- 
perimental evidence that Transformation- 
Based Learning is as effective as alterna- 
tive approaches (such as Decision Trees 
and N-Grams) for a discourse task called 
Dialogue Act Tagging, and argues that 
Transformation-Based Learning has desirable 
features that make it particularly appealing 
for the Dialogue Act Tagging task. 

1 INTRODUCTION 

Transformation-Based Learning is a relatively new 
machine learning method, which has been as effec- 
tive as any other approach on the Part-of-Specch 
Tagging problem^ (Brill, 1995a). We are utilizing 
Transformation-Based Learning for another important 
language task called Dialogue Act Tagging, in which 
the goal is to label each utterance in a conversational 
dialogue with the proper dialogue act. A dialogue act 
is a concise abstraction of a speaker's intention, such as 
SUGGEST or ACCEPT. Recognizing dialogue acts is 
critical for discourse-level understanding and can also 

1 The goal of this Natural Language Processing task is 
to label words with the proper part of speech tags, such as 
Noun and Verb. 



be useful for other applications, such as resolving am- 
biguity in speech recognition. But computing dialogue 
acts is a challenging task, because often a dialogue act 
cannot be directly inferred from a literal reading of an 
utterance. Figure 111 presents a hypothetical dialogue 
that has been labeled with dialogue acts. 

Our research efforts led us to address some limitations 
of Transformation-Based Learning. We developed a 
Monte Carlo version of the algorithm that overcomes 
the limitation of Transformation-Based Learning's de- 
pendence on manually-generated rule templates and 
enables Transformation-Based Learning to be applied 
effectively to a wider range of tasks. We also devised 
a technique that uses a committee of learned models 
to derive confidence measures associated with the dia- 
logue acts assigned to utterances. 

We experimentally compared our modified version of 
Transformation-Based Learning with C5.0, an imple- 
mentation of Decision Trees, and N-Grams, which was 
previously the best reported method for Dialogue Act 
Tagging (Reithingcr and Klescn, 1997). Our system 
performs as well as these benchmarks, and we note 
that Transformation-Based Learning has several char- 
acteristics that make it particularly appealing for the 
Dialogue Act Tagging task. 

This paper begins with an overview of the 
Transformation-Based Learning method, describing 
the training phase and the application phase of the al- 
gorithm and presenting some of Transformation-Based 
Learning's most attractive characteristics for Dialogue 
Act Tagging. The following section describes the ex- 
perimental design used for the experiments presented 
in the paper. Then Section 4 presents two limi- 
tations of Transformation-Based Learning, a depen- 
dence on rule templates and a lack of confidence mea- 
sures, and describes our solutions for these problems, 
a Monte Carlo strategy and a committee method. 
Next we present an experimental comparison between 
Transformation-Based Learning, N-Grams, and Deci- 
sion Trees, and conclude with a discussion of this work. 



# 


Speaker 


Utterance 


Dialogue Act 


1 


John 


Hello. 


GREET 


2 


John 


I'd like to meet with you on Tuesday at 2:00. 


SUGGEST 


3 


Mary 


That's no good for me, 


REJECT 


4 


Mary 


but I'm free at 3:00. 


SUGGEST 


5 


John 


That sounds fine to me. 


ACCEPT 


6 


John 


I'll see you then. 


BYE 



Figure 1: A sample dialogue 



2 TRANSFORMATION-BASED 
LEARNING 

Brill (1995a) developed a symbolic machine learn- 
ing method called Transformation-Based Learning. 
Given a tagged training corpus, Transformation-Based 
Learning produces a sequence of rules that serves as a 
model of the training data. Then, to derive the ap- 
propriate tags, each rule may be applied, in order, 
to each instance in an untagged corpus. For all of 
the results and examples in this paper, we are using 
Transformation-Based Learning on the Dialogue Act 
Tagging task, so the instances are utterances and the 
tags are dialogue acts. In one experiment, our system 
produced a learned model with 213 rules; the first five 
rules are presented in Figure 0. 







New 


# 


Condition(s) 


Dialogue Act 


1 


none 


SUGGEST 


2 


Includes "see" and "you" 


BYE 


3 


Includes "sounds" 


ACCEPT 


4 


Length < 4 words 
Previous tag is none 2 


GREET 


5 


Includes "no" 

Previous tag is SUGGEST 


REJECT 



Figure 2: Rules produced by Transformation-Based 
Learning for Dialogue Act Tagging 



2.1 THE TRAINING PHASE 

The training phase of TBL, in which the system learns 
a sequence of rules based on a tagged training corpus, 
proceeds in the following manner: 

1. Label each instance with a dummy tag. 

2. Until no useful rules are found, 

a. For each incorrect tag 

i . Generate all rules that 
correct the tag. 

b. Score each generated rule. 

c. Output the highest scoring rule. 

d. Apply this rule to the corpus. 

2 This condition is true only for the first utterance of a 
dialogue. 



First, the system initializes the training corpus by la- 
beling each instance with a dummy tag. Brill (1995a) 
suggested using a more complex initialization step, but 
we found that this simple strategy is more effective in 
practice.^ Then the system generates all of the poten- 
tial rules that would make at least one tag in the train- 
ing corpus correct, under the restrictions described be- 
low. For each potential rule, its improvement score is 
defined to be the number of correct tags in the train- 
ing corpus after applying the rule minus the number of 
correct tags in the training corpus before applying the 
rule. The potential rule with the highest improvement 
score is output as the next rule in the final model and 
applied to the entire training corpus. This process re- 
peats (using the updated tags on the training corpus), 
producing one rule for each pass through the training 
corpus until no rule can be found with an improve- 
ment score that surpasses some predefined threshold. 
In practice, threshold values of 1 or 2 appear to be 
effective. 

Since there are potentially an infinite number of rules 
that could produce the tags in the training data, it is 
necessary to restrict the range of patterns that the sys- 
tem may consider by providing a set of rule templates, 
such as: 

IF utterance u contains the word(s) w 

AND the tag on the utterance preceding u is X 
THEN change u's tag to Y 

This template can be instantiated to produce the last 
rule in Figure | by setting w="no", X=SUGGEST, 
and Y=REJECT. 

For the first rules of the learned model, the emphasis 
is on getting as many tags correct as possible with 
no penalty imposed for changing an incorrect tag to 
another incorrect tag. Then for the later rules, the 
system must avoid changing any of the tags that are 



3 This is because Transformation-Based Learning uses 
an error-driven approach, only generating rules for the in- 
stances that are incorrectly labeled. If every instance is 
initialized with a dummy tag, then all of the labels are 
incorrect, and so they all contribute to learning. Alterna- 
tively, using a more involved initialization step results in a 
greater number of correct tags and, effectively, less training 
data. 



already correct. Thus, this method tends to produce 
a sequence of rules that progresses from general rules 
to specific rules. 

2.2 THE APPLICATION PHASE 

To see how a rule sequence can be used to label data, 
consider applying the rules in Figure || to the dialogue 
in Figure |l| The first rule labels every utterance with 
the dialogue act SUGGEST. Next, the second rule 
changes an utterance's tag to BYE if it contains the 
words "see" and "you" , which only holds for utterance 
#6. Similarly, the third rule changes utterance #5's 
tag to ACCEPT. Then the fourth rule tags utterance 
#1 as GREET, since its length is 1 and there is no pre- 
ceding utterance in the dialogue. And finally, the last 
rule relabels utterance #3 as REJECT, since utter- 
ance #2 is currently tagged SUGGEST, and the word 
"no" is found in utterance #3. Although the first five 
rules label these six utterances correctly, the remain- 
ing 208 rules in the sequence may continue to adjust 
the tags on the utterances. 

2.3 ATTRACTIVE CHARACTERISTICS 

For the Dialogue Act Tagging task, we selected 
Transformation-Based Learning for several reasons. 
Brill reported that Transformation-Based Learning is 
as good as or better than any other algorithm for the 
Part-of-Speech Tagging problem, labeling 97.2% of the 
words correctly. The part-of-speech tag of a word is 
dependent on the word's internal features and on the 
surrounding words; similarly, the dialogue act of an 
utterance is dependent on the utterance's internal fea- 
tures and on the surrounding utterances. This parallel 
suggests that Transformation-Based Learning has po- 
tential for success on the Dialogue Act Tagging prob- 
lem. 

Since we currently lack a systematic theory of dia- 
logue acts, another reason that Transformation-Based 
Learning is an attractive choice is that its learned 
model consists of relatively intuitive rules (Brill, 
1995a), which a human can analyze to determine what 
the system has learned and develop a working theory. 
Also, Transformation-Based Learning is good at ig- 
noring any potential rules that are irrelevant. This 
is because irrelevant rules tend to have a random ef- 
fect on the training data, which usually results in 
low improvement scores, so these rules are unlikely 
to be selected for inclusion in the final model. This 
is very helpful for Dialogue Act Tagging, since we 
don't know what the relevant templates are for this 
problem. Ramshaw and Marcus (1994) experimen- 
tally demonstrated Transformation-Based Learning's 
robustness with respect to irrelevant rules. 

For these reasons, along with others that are pre- 



sented at the end of the paper, we believe that 
Transformation-Based Learning is worthy of investi- 
gation for the Dialogue Act Tagging task. 

3 EXPERIMENTAL DESIGN 

All of the results presented in this paper followed the 
same experimental design as the third experiment in 
Rcithinger and Klesen (1997). The corpus consisted of 
appointment-scheduling face-to-face dialogues in En- 
glish, which was divided into a training set with 143 
dialogues (2701 utterances) and a disjoint testing set 
with 20 dialogues (328 utterances). Each utterance 
was manually labeled with one of 18 abstract dia- 
logue acts, such as SUGGEST, ACCEPT, REJECT, 
GREET, and BYE. The full list of dialogue acts is 
found in Reithinger and Klesen (1997). 

The Transformation-Based Learning experiments pre- 
sented in this paper were run on a Sun Ultra 1 ma- 
chine with 508MB of main memory. Within a set of 
experiments, only the specified parameters were var- 
ied, but between sets of experiments many parameters 
may have been varied, so it is not possible to draw 
conclusions across experiment sets. 

Our rule templates consist of all possible combinations 
of a preselected set of conditions. Some of these con- 
ditions are presented in Figure [|. Each condition con- 
sists of a feature and a distance, where the feature 
specifies a characteristic of utterances that might be 
relevant for the Dialogue Act Tagging task, and the 
distance specifies the relative position (from the utter- 
ance under analysis) of the utterance that the feature 
should be applied to. 



Feature 




Distance 


length 


of the 


current utterance 


tag 


of the 


preceding utterance 


cue patterns 


of the 


current utterance 


speaker 


of the 


current utterance 


speaker 


of the 


preceding utterance 



Figure 3: Some conditions used in our experiments 



In discourse, it is widely acknowledged that some of 
the short phrases (and specific words) found in an 
utterance provide strong clues to determine the ap- 
propriate dialogue act. Several researchers proposed 
different cue phrases, which are phrases that appear 
frequently in dialogue and convey useful discourse in- 
formation, such as "but" , "so" , and "by the way" . Un- 
fortunately, there is no universal agreement on which 
phrases should be considered cue phrases, and in a pre- 
liminary experiment using all of the cue phrases pro- 
posed in the literature^ our system's accuracy only 

4 These lists of cue phrases can be found in Hirschberg 



improved by 1.03%. 

In order to identify the phrases that will be useful for a 
particular domain, we need an automatic method for 
collecting a set of phrases that is tuned to that do- 
main. So we are using a statistical approach to select 
relevant cue patternM from a training corpus. Assum- 
ing that a phrase is relevant if it co-occurs frequently 
with a few specific dialogue acts, we analyze the dis- 
tribution of dialogue acts for utterances that include a 
given phrase, selecting those phrases that correspond 
to dialogue act distributions with low entropy. When 
using these cue patterns, our system's accuracy rose 
by 17.63%. For more details on this work, see Samuel, 
Carberry, and Vijay-Shanker (1998b). 

4 TRANSFORMATION-BASED 
LEARNING IN DISCOURSE 

4.1 TWO LIMITATIONS 

Transformation-Based Learning has two serious limi- 
tations, which we will address in this section. First, 
although Transformation-Based Learning produces a 
tag for each instance, it doesn't offer any measure 
of confidence in these tags. Alternatively, probabilis- 
tic machine learning approaches generally label an in- 
stance with a set of tags, which are assigned numbers 
to represent the likelihood that they are correct. So 
"probabilistic methods ... provide a continuous rank- 
ing of alternative analyses rather than just a single 
output, and such rankings can productively increase 
the bandwidth between components of a modular sys- 
tem." (Brill and Mooney, 1997) 

The second limitation of Transformation-Based Learn- 
ing is that it is highly dependent on the rule templates, 
which are manually developed in advance. Since the 
omission of any relevant templates would handicap the 
system, it is essential that these choices be made care- 
fully. But in Dialogue Act Tagging, no one knows ex- 
actly which conditions and combinations of conditions 
are relevant, so it is preferable to err on the side of cau- 
tion by constructing an overly-general set of templates 
and allowing the system to learn which templates are 
useful. As discussed earlier, Transformation-Based 
Learning is capable of discarding irrelevant rules, so 
this approach should be effective, in theory. 

Unfortunately, this strategy is not tractable, because 
for each pass through the training data, for each in- 
stance that the system has tagged incorrectly, every 
rule template must be instantiated in all possible ways. 

and Litman (1993) and Knott (1996). 

In practice, the concept of cue patterns tends to 
be more general than cue phrases, including many more 
phrases. 



Suppose that we can postulate f different features that 
might be relevant, and we wish to consider these fea- 
tures for all instances that occur within a distance 
d of a given instance. (In other words, we are us- 
ing a contextual window of size 2d+l.) Then there 
are (2d + l)f conditions and 2( 2d+1 ) f possible tem- 
plates, since each condition may either be included or 
excluded. Also, suppose that when a feature is applied 
to an instance, it produces v distinct values, on aver- 
age. This results in (v + l)( 2d + 1 ) f rules per instance, 
which can be proven by induction on the number of 
conditions. Given a training corpus with i instances, 
if the algorithm makes p passes through the train- 
ing data, then the system must generate and evaluate 
0(ip(v + l)( 2d+1 ) f ) rules. Some realistic values for 
these variables are f=10, d=2 (a contextual window 
of size 5), v=3, i=3000, and p=100, which generates 
around 10 35 rules. Based on experimental evidence, 
it appears that it is necessary to drastically limit the 
number of potential rules that the system generates^] 
or the memory and time costs are so exorbitant that 
the method becomes intractable. But this limitation 
would preclude considering all of the features and fea- 
ture interactions that might be relevant for Dialogue 
Act Tagging. 

4.2 A MONTE CARLO VERSION 

We developed a Monte Carlo version of 
Transformation-Based Learning, so that the sys- 
tem can consider a huge number of templates while 
still maintaining tractability. Rather than exhaus- 
tively searching through the space of possible rules, 
only R of the available template instantiations are 
randomly selected for each training instance on each 
pass through the training data, where R is some small 
integer. With this modification, the total number 
of rules generated is only O(ipR), which no longer 
explodes with the number of templates. In fact, 
the formula doesn't even depend on the number of 
features, the contextual window size, or the value of 
v. But one would still expect good results, because 
Transformation-Based Learning only needs to find the 
best rules, and the best rules tend to be effective for 
a large number of different instances. So the system 
has many opportunities to find these rules, and since 
the algorithm generally makes many passes through 
the training data before halting, if it should select a 
suboptimal rule, it can use later rules to compensate. 
Thus, although random sampling will miss some rules, 
it is still highly likely to find an effective sequence of 
rules. 

Our experiments confirm these intuitions, as shown 
in Figures |J and ||. For these runs, eight condi- 

6 For the Part-of-Speech Tagging task, Brill used only 
about 30 simple rule templates (Brill, 1995a). 




Standard TBL 
Monte Carlo TBL with R=16 
Monte Carlo TBL with R=6 
Monte Carlo TBL with R=l 



012345678# Conditions 

Figure 4: Number of conditions vs. training time 



tions were preselected, and for different values of n, 
<n< 8, the first n conditions were combined in all 
possible ways to generate 2 n templates. Using these 
templates, we trained, tested, and compared the stan- 
dard Transformation-Based Learning method and our 
Monte Carlo version of Transformation-Based Learn- 
ing. 

For the standard Transformation-Based Learning 
method, training time rises dramatically as the num- 
ber of conditions increases, as shown in Figure ^.^J 
In fact, when given seven conditions, the standard 
Transformation-Based Learning algorithm could not 
complete the training phase, even after running for 
more than 24 hours. But our Monte Carlo version 
of Transformation-Based Learning keeps the efficiency 
relatively stable.^ The reason for the slight increase in 
training time as the number of conditions increases is 



7 The value of v (the average number of rules generated 
per instance) varies slightly across the eight conditions, 
and so the shape of the curve might vary depending on 
the order in which the conditions are presented. But the 
critical point is that the training time rises exponentially 
with the number of conditions. 

s The Monte Carlo version of Transformation-Based 
Learning can be slower than the standard method, because 
the Monte Carlo version always generates R rules for each 
instance, without checking for repetitions. (It would be too 
inefficient to prevent the system from generating any rule 
more than once.) 



that, as the system gains access to a greater number 
of useful conditions, it's likely to find a greater num- 
ber of useful rules, meaning that the training phase 
makes a greater number of passes through the train- 
ing data. Thus, p increases, and so the training time, 
O(ipR), also increases. But this increase is linear (or 
less), while standard Transformation-Based Learning's 
training time increases exponentially with the number 
of conditions. Figure § supports this analysis. 

This improvement in time efficiency would be quite un- 
interesting if the performance of the algorithm deteri- 
orated significantly. But, as Figure || shows, this is not 
the case. Although setting R too low (such as R=l for 
7 and 8 conditions) may result in a decrease in accu- 
racy, the lowest possible setting (R=l) is as accurate 
as standard Transformation-Based Learning for 6 con- 
ditions (64 templates). For 7 and 8 conditions, train- 
ing of the standard Transformation-Based Learning 
method took too much time, so those results could not 
be produced. But, as the curves for R=6 and R=16 do 
not differ significantly, it is reasonable to predict that 
standard Transformation-Based Learning would pro- 
duce similar results as well.[] Therefore, we conclude 

9 One might wonder how the Monte Carlo version of 
Transformation-Based Learning can ever do better than 
the standard Transformation-Based Learning method, 
which occurred for the experiments that used five con- 
ditions. Because Transformation-Based Learning is a 
greedy algorithm, choosing the best available rule on each 




— Standard TBL 

Monte Carlo TBL with R=16 

Monte Carlo TBL with R=6 

Monte Carlo TBL with R=l 




012345678# Conditions 

Figure 5: Number of conditions vs. tagging accuracy on unseen data 



that our Monte Carlo version of Transformation-Based 
Learning (with R=6) works effectively for more than 
250 templates (8 conditions) in only about 15 minutes 
of training time. 

4.3 A COMMITTEE METHOD 

We wanted to extend Transformation-Based Learning 
so that it could provide some idea of the likelihood 
that each of its tags are correct. So we attempted to 
develop a strategy for assigning confidence measures 
to the rules in the learned model. Then, in the ap- 
plication phase, a given instance's confidence measure 
would be a function of the confidences of the rules that 
applied to that instance. Unfortunately, due to the na- 
ture of the Transformation-Based Learning method, 
this straightforward approach has been unsuccessful, 
because the rule sequence does not contain enough 
information to derive confidence measures; often, the 
same pattern of rules applies to instances that should 
be marked with high confidence as well as instances 
that should be marked with low confidence. 

So, for the purpose of computing confidence measures, 
we adapted two techniques that were developed for 
very different tasks. The Boosting approach has been 
used to improve accuracy in tagging data (Freund and 
Schapire, 1996), and Committee-Based Sampling uti- 
lized a very similar strategy to minimize the required 

pass through the training data, sometimes the standard 
Transformation-Based Learning method selects a rule that 
locks it into a local maximum, while the Monte Carlo ver- 
sion might fail to consider this attractive rule and end up 
producing a better model. 



size of a training corpus (Dagan and Engelson, 1995). 
We applied these methods to compute confidence mea- 
sures, by training the system a number of times to 
produce a few different but reasonable learned models, 
which are called committee members. Then given new 
data, each committee member independently tags the 
input, and a given tag's confidence is based on how 
well the committee members agree on that tag. We 
are currently defining the confidence of a given tag to 
be the number of committee members that preferred 
the tag. In the future, we will investigate confidence 
formulas that are based on the entropy of the tags se- 
lected by the different committee members. 

We considered several ways to develop the committee 
members, and we decided to apply the strategy that 
Freund and Schapire (1996) used for Boosting: The 
first committee member is trained in the standard way, 
and then the second committee member pays special 
attention to those instances in the training data that 
the first committee member did not tag correctly. To 
do this in Transformation-Based Learning, we adjust 
the improvement score formula to weight success on 
these "hard" instances more heavily. (In effect, it is 
as if we were adding multiple copies of these instances 
to the training corpus.) This process can be repeated 
to generate more committee members by basing the 
score for correctly tagging a training instance on the 
number of previous committee members that tagged 
that instance incorrectly. We are currently using 2 C 
as the score for correctly tagging a given instance that 
c committee members have mistagged. This strategy 
tends to produce committee members that are very 
different, as they are focusing on different parts of the 



training corpus. 



Minimum 
Confidence 


Percentage of 
Instances Tagged 


Average 
Precision 


5 
4 
3 
2 
1 


45.12% ± 1.28% 
69.79% ± 1.60% 
92.38% ± 1.32% 
99.85% ± 0.20% 
100.00% ± 0.00% 


90.09% ± 1.51% 
83.53% ± 1.27% 
76.57% ± 0.79% 
73.56% ± 1.10% 
73.45% ± 1.06% 



Figure 6: Testing the committee method on unseen 
data, varying the minimum confidence considered 



As a preliminary experiment we ran ten trials with five 
committee members, testing on held-out data. Fig- 
ure H presents average scores and standard deviations, 
varying the minimum confidence, m. For a given in- 
stance, if at least m committee members agreed on 
a tag, then the most popular tag was applied, break- 
ing ties in favor of the committee member that was 
developed the earliest; otherwise no tag was output. 
The results show that the committee approach as- 
signs useful confidence measures to the tags: All five 
committee members agreed on the tags for 45.12% of 
the instances, and 90.09% of those tags were correct. 
Also, for 69.79% of the instances, at least four of the 
five committee members selected the same tag, and 
this tag was correct 83.53% of the time. We foresee 
that our module for tagging dialogue acts can poten- 
tially be integrated into a larger system so that, when 
Transformation-Based Learning cannot produce a tag 
with high confidence, other modules may be invoked 
to provide more evidence. In addition, like Boost- 
ing, the committee method improves the overall ac- 
curacy of the system. By selecting the most popular 
tag among all five committee members, the average ac- 
curacy in tagging unseen data was 73.45%, while using 
the first committee member alone resulted in a signifi- 
cantly (t = 5.42 > 2.88, a = 0.01) lower average score 
of 70.79%. 

4.4 ALTERNATIVE METHODS 

Previously, the best success rate achieved on the Dia- 
logue Act Tagging problem was reported by Reithinger 
and Klesen (1997), whose system used a probabilistic 
machine learning approach based on N-Grams to cor- 
rectly label 74.7% of the utterances in a test corpus. 
(See Samuel, Carberry, and Vijay-Shanker (1998a) for 
a more extensive analysis of previous work on this 
task.) As a direct comparison, we applied our system 
to exactly the same training and testing set. Over 
five runs, the system achieved an avera ger^ accuracy 
of 75.12%±1.34%, including a high scorefjof 77.44%. 

The variation in the scores is due to the random nature 
of the Monte Carlo method. 

11 The rules in Figure H were produced in this experiment. 



In addition, we ran a direct comparison between 
Transformation-Based Learning and C5.0 (Rulequest 
Research, 1998), which is an implementation of the 
Decision Trees method. The accuracies on held-out 
data for training sets of various sizes are presented 
in Figure pi For Transformation-Based Learning, we 
averaged the scores of ten trials for each training set 
(to factor out the random effects of the Monte Carlo 
method) , and the standard deviations are represented 
by error bars in the graph. These experiments did not 
utilize the committee method, and we would expect 
the scores to improve when this extension is used. 

With C5.0, we wanted to use the same features that 
were effective for Transformation-Based Learning, but 
we encountered two problems: 1) Since C5.0 requires 
that each feature take exactly one value for each in- 
stance, it is very difficult to utilize the cue patterns 
feature. We decided to provide one boolean feature 
for each possible cue pattern, which was set to True 
for instances that included that cue pattern and False 
otherwise. 2) Our Transformation-Based Learning sys- 
tem utilized the system-generated tagQ of the preced- 
ing instance. C5.0 cannot use this information, as it 
requires that the values of all of the features are com- 
puted before training begins. 

The training times of Transformation-Based Learning 
and C5.0 were relatively comparable for any number 
of conditions, although Boosting sometimes resulted 
in a significant increase in training time. The ac- 
curacy scores of Transformation-Based Learning and 
C5.0, with and without Boosting, are not significantly 
different, as shown in Figure |?]. 

5 DISCUSSION 

This paper has described the first investigation of 
Transformation-Based Learning applied to discourse- 
level problems. We extended the algorithm to ad- 
dress two limitations of Transformation-Based Learn- 
ing: 1) We developed a Monte Carlo version of 
Transformation-Based Learning, and our experiments 
suggest that this improvement dramatically increases 
the efficiency of the method without compromising ac- 
curacy. This revision enables Transformation-Based 
Learning to work effectively on a wider variety of tasks, 
including tasks where the relevant conditions and con- 
dition combinations are not known in advance as well 
as tasks where there are a large number of relevant 
conditions and condition combinations. This improve- 
ment also decreases the labor demands on the human 
developer, who no longer needs to construct a mini- 

12 For Transformation-Based Learning, the tags change 
as the system applies the rules in the learned model. When 
a rule references a tag, it uses the value of the tag at the 
point when that rule is processed. 




Figure 7: Training set size vs. tagging accuracy on unseen data 



mal set of rule templates. It is sufficient to list all of 
the conditions that might be relevant and allow the 
system to consider all possible combinations of those 
conditions. 2) We devised a committee strategy for 
computing confidence measures to represent the reli- 
ability of tags. In our experiments, this committee 
method improved the overall tagging accuracy signif- 
icantly. It also produced useful confidence measures; 
nearly half of the tags were assigned high confidence, 
and of these, 90% were correct. 

For the Dialogue Act Tagging task, our modified ver- 
sion of Transformation-Based Learning has achieved 
an accuracy rate that is comparable to any previously 
reported system. In addition, Transformation-Based 
Learning has a number of features that make it par- 
ticularly appealing for the Dialogue Act Tagging task: 

1. Transformation-Based Learning's learned model 
consists of a relatively short sequence of intuitive 
rules, stressing relevant features and highlight- 
ing important relationships between features and 
tags (Brill, 1995a). Thus, Transformation-Based 
Learning's learned model offers insights into a the- 
ory to explain the training data. This is especially 
useful in Dialogue Act Tagging, which currently 
lacks a systematic theory. 

2. With its iterative training algorithm, when devel- 
oping a new rule, Transformation-Based Learning 
can consider tags that have been produced by pre- 
vious rules (Ramshaw and Marcus, 1994). Since 
the dialogue act of an utterance is affected by the 
surrounding dialogue acts, this leveraged learn- 
ing approach can directly integrate the relevant 



contextual information into the rules. In addi- 
tion, Transformation-Based Learning can accom- 
modate the focus shifts that frequently occur in 
discourse by utilizing features that consider tags 
of varying distances. 

3. Our Transformation-Based Learning system is 
very flexible with respect to the types of features 
it can utilize. For example, it can learn set-valued 
features, such as cue patterns. Additionally, be- 
cause of the Monte Carlo improvement, our sys- 
tem can handle a very large number of features. 

4. For the Dialogue Act Tagging task, people still 
don't know what features are relevant, so it is very 
difficult to construct an appropriate set of rule 
templates. Fortunately, Transformation-Based 
Learning is capable of discarding irrelevant rules, 
as Ramshaw and Marcus (1994) showed exper- 
imentally, so it is not necessary that all of the 
given rule templates be useful. 

5. Ramshaw and Marcus's (1994) experiments sug- 
gest that Transformation-Based Learning tends to 
be resistant to the overnttingjj^ problem. This can 
be explained by observing how the rule sequence 
produced by Transformation-Based Learning pro- 
gresses from general rules to specific rules. The 
early rules in the sequence are based on many ex- 
amples in the training corpus, and so they are 
likely to generalize effectively to new data. Later 
in the sequence, the rules don't receive as much 

13 Other machine learning algorithms may overfit to the 
training data and then have difficulty generalizing to new 
data. 



support from the training data, and their applica- 
bility conditions tend to be very specific, so they 
have little or no effect on new data. Thus, resis- 
tance to overfitting is an emergent property of the 
Transformation-Based Learning algorithm. 

For the future, we intend to investigate a wider variety 
of features and explore different methods for collecting 
cue patterns to increase our system's accuracy scores 
further. Although we compared Transformation- 
Based Learning with a few very different machine 
learning algorithms, we still hope to examine other 
methods, such as Naive Bayes. In addition, we plan 
to run our experiments with different corpora to con- 
firm that the encouraging results of our extensions to 
Transformation-Based Learning can be generalized to 
different data, languages, domains, and tasks. We 
would also like to extend our system so that it may 
learn from untagged data, as there is still very little 
tagged data available in discourse. Brill developed an 
unsupervised version of Transformation-Based Learn- 
ing for Part-of-Speech Tagging (Brill, 1995b), but this 
algorithm must be initialized with instances that can 
be tagged unambiguously (such as "the" , which is al- 
ways a determiner), and in Dialogue Act Tagging there 
are very few unambiguous examples. We intend to 
investigate the following weakly-supervised approach: 
First, the system will be trained on a small set of 
tagged data to produce a number of different com- 
mittee members. Then given untagged data, it will 
derive tags with confidence measures. Those tags that 
receive very high confidence can be used as unam- 
biguous examples to drive the unsupervised version of 
Transformation-Based Learning. 
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